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1, INTRODUCTION 

Microarray technology is a branch of biology technology which aims to study the expression of 
genes from the cell [1]. It places the gene sequences on a glass slide called gene chip. The gene chip is 
designed to display the sequences of deoxyribonucleic acid (DNA) or ribonucleic acid (RNA). 
Complementary base pairing between the sample cell and gene sequences on the chip produces different 
colours based on the expression level of the gene. The introduction of microarray technology allows 
researchers to analyse thousands of gene expression profiles simultaneously [2-5]. The datasets produced by 
microarray technology is known as gene expression dataset [2]. Much biomedical research, especially 
cancerous research, has been increased. However, the properties of large dimension would affect the result of 
research as well. Since the microarray dataset is large dimension, classifying and computing the algorithms 
becomes more complex to study the gene expression characteristics [6]. 

Besides that, microarray datasets have many improper attributes and missing values might occur 
after the first collection of dataset. The accuracy of the classification algorithm would be affected. 
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Hence, data pre-processing is one of the mandatory processes to undergo before the dataset can be applied 
into other mainstream research algorithms [7]. 

In the next section, we would like to introduce the used of gene expression dataset and its 
information, followed by the method to pre-process the dataset. While in section 3, the outcome after 
pre-processing the data will be demonstrated and a comparison will be made to showcase the difference 
before and after pre-processing of dataset. Lastly, we would like to conclude with the outcome before the 
ending of this research paper. 


2. RESEARCH MATERIAL & METHODOLOGY 

In this section, the material and methodology applied in the study will be explained. 
Gene expression dataset is applied as input dataset for the data pre-processing purpose. In this study, 
Bioconductor and significant directed random walk (sDRW) will play the role to pre-process the data before 
it has been used to predict and classified the genes. 


2.1. Microarray Data 

Gene expression dataset is the dataset produced by microarray technology. It can be accessed from 
National Center for Biotechnology Information (NCBI). In this research, GSE10072 [8] is downloaded in 
raw file. The platform to prepare this Affymetrix microarray gene expression dataset is GPL96. The samples 
identification (ID) of lung cancer dataset are between GSM254625 to GSM254731. GSE10072 is one of the 
lung cancer type samples set. It has 107 samples, of which 58 are cancerous samples and 49 are normal 
samples. In overall, GSE10072 has 13788 genes. 


2.2. Methodology 

A raw file of gene expression data is saturated with an abundance of information extracted from the 
cell. This raw file needs to be processed in order for the right attributes to be extracted for the next research 
study. R programming language is chosen and hence packages that are build up by R programming language 
will be used to pre-process the dataset. In our study, the Bioconductor package is downloaded and imported 
for the purposed of data pre-processing [9]. The Bioconductor will analyse the expression value and further 
arrange the dataset using normalization which narrows the range of data to be studied. In this study, 
the dataset will undergo 3 pre-processing stages before being applied into the real classification algorithms 
such as genetic algorithm [10], pathway based cancer classification [11], and significant directed random 
walk (SDRW) [12]. Figure | illustrate the phases in data pre-processing stage. 
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Figure 1. Data pre-processing of gene expression dataset 


The first 2 steps will run under Bioconductor while the last step will run under sDRW. 
First, unwanted attributes, missing value and proper arrangements of dataset will be applied in order to clean 
the data. Figure 2 shows the details of the first step of data pre-processing. Unwanted attributes and the 
samples that have missing values will be removed. Rearrangement of data according to the requirement of 
format will be run through before proceeding to the next phases. Other than expression value, there is other 
information (attributes) such as patient biological information and dataset information which included in the 
gene expression dataset. All of this information is not going to apply in sDRW for cancer classification 
purposes [13]. Hence, these attributes are considered as unwanted attributes and will be remove from the 
dataset. Only wanted attributes will be kept for the cancer classification purposes [14], [15]. Example of 
wanted attributes are expression values, the position of genes, the means of gene’s weight, and so on. 

Attributes that have missing values will be prohibited because without the expression value, 
the deoxyribonucleic acid (DNA) sequences could not determine the actual expression value and the result 
will be affected if predicted value is applied to it. Hence, attributes that contain missing values will be 
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eliminated. Besides that, proper arrangement of datasets will be applied. DNA sequences are after each other 
and the “X”’ and “Y” value will determine the placing of the genes. The “X” and “Y” axis will determine the 
position of the genes in the sequences of DNA. This is important for the next process in sDRW because this 
could help in determining the next gene and hence, further referencing could be made by comparing with 
other reference data such as Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and protein- 
protein interaction (PPI) sequences. Other than this, an additional of 3 values (mean, standard deviation and 
npixels) are used as additional attributes in cancer prediction and classification process. The mean is defined 
as the average of the sum of the weight of the gene. Standard deviation is the parameter which is used to 
quantify the amount of variation in the gene’s weight.The npixels are the linear dimension of the genes in 
pixels. 
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Figure 2. Step 1, remove unwanted attributes, missing value & proper arrangement 


Second, normalization is applied to degrade the big value and causes the weight of the genes fall 
into the range between O — 10 [16]. During the normalization phase, gene’s weight, means, and standard 
deviation is used to calculate the normalized value. For the normalized values that are greater than 10, it will 
be removed as we just want to keep the digit within 0 to 10. This is to remove the insignificant genes. 
Figure 3 illustrate the steps in normalization. 
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Figure 3. Step 2, normalization 





Lastly, filtering method [17] is applied to select those genes that have p-value less than 0.05. This is 
because p value will determine the significant towards cancer mutation. Figure 4 shows the steps in filtering 
method. This step will be take places in sDRW. 
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Figure 4. Step 3, filtering method 
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After having gone through the data pre-processing stage, the cleaned dataset is now ready to be 
applied in the evaluation algorithm and further implemented in classifier as well. Data pre-processing not 
only clears up the dataset to be ready for the implementation purpose but also allow the researchers to select 
the right attributes that would be the key influences for the study. In sDRW [12], Seah believes the weight of 
genes plays an important role in affecting the tumour formation and hence, during data pre-processing stage, 
he was focusing on those attribute that are related to gene’s weight. 


3. RESULTS AND ANALYSIS 

In order to showcase the data pre-processing by Bioconductor and sDRW, lung cancer dataset, 
GSE10072 is used as an example. Dataset will be processed step by step according to the sequences 
arrangement in methodology. Originally, a raw file contains much information about the dataset which 
includes the unwanted information for the cancer classification process. But after data pre-processing, 
only wanted attributes are kept and being a further process in sDRW for the cancerous gene prediction and 
cancer classification purposes. Figure 5 shows part of the visualization of dataset in raw type. While Figure 6 
illustrates the outcome of dataset after data pre-processing. 
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Figure 5. Visualization of GSE10072 CEL file Figure 6. Visualization of GSE10072 after data 
pre-processing 


By comparing between Figure 2 and Figure 3, we can clearly differentiate the data arrangement 
whilst visualizing it. As in Figure 2, the dataset 1s arranged in 2 rows and the only way to differentiate the 
values are the spacing applied. There are also unwanted attributes such as patient biological information, 
dataset’s information and so on. In Figure 3, the dataset is arranged in sequences and more rows are applied 
to differentiate between attributes. The right attributes are playing an important role in the algorithm because 
it can ease the running process of algorithm. 


4. CONCLUSION 

For cancer classification purpose, we presented the data pre-processing stage with the example of 
gene expression dataset. Gene expression dataset contains many attributes and much biological information 
about the samples. Hence, choosing the right attribute that could affect the algorithm is one of the important 
key steps which should not be ignored or neglected. Data pre-processing stages allow researchers to craft the 
dataset as intended. In this stage, the data will be accordingly clean and turned into the type of clean data that 
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is needed for the next machine learning process. For instance, we only focus on the selection of attributes that 
are weight-related. 

Another direction of future research is to combine the cleaned data and enhanced the number of 
samples with a combination of other similar datasets. The number of samples in cleaned data after 
pre-processing stage is lower compared to the original dataset. Hence, it is possible to combine with other 
similar datasets to produce more samples in a dataset. It is a commonly seen scenario whereby there is a 
multitude of biological datasets which share the same features but were collected by different researchers 
under different experimental conditions. Though they may display different underlying distributions, 
they share highly relevant information. Each cleaned dataset contains limited samples, but high dimensions 
of gene expression value is insufficient to be considered as a good classifier. In such cases, transfer learning 
is one of the possible way to borrow more samples between datasets. 
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